An Empirical Determination of Samples for Decision Trees
نویسنده
چکیده
Because it is not known to determine a proper sample size for data mining tasks, the task of determining proper sample sizes for decision trees that are one of the best data mining algorithms is arbitrary, and as the size of samples grows, the size of generated decision trees grows with some improvement in error rates. But we cannot use larger and larger samples, because it’s not easy to understand large decision trees and data overfitting problem can happen with limited target data set. This paper suggests an objective approach in determining a proper sample size to generate good decision trees with respect to generated tree size and error rates. Experiments with two representative decision tree algorithms, CART and C4.5 show very promising results. Key-Words: decision trees, proper sample size determination
منابع مشابه
Predicting The Type of Malaria Using Classification and Regression Decision Trees
Predicting The Type of Malaria Using Classification and Regression Decision Trees Maryam Ashoori1 *, Fatemeh Hamzavi2 1School of Technical and Engineering, Higher Educational Complex of Saravan, Saravan, Iran 2School of Agriculture, Higher Educational Complex of Saravan, Saravan, Iran Abstract Background: Malaria is an infectious disease infecting 200 - 300 million people annually. Environme...
متن کاملDetermine the most suitable Allometric equations for Estimating Above-ground Biomass of the Juniperus excelsa
Today, modeling and determination of allometric equations of forest trees, especially Junipers trees, are very important for determination of biological status and carbon storage capacity of forest species. The aim of this study was to determine the most suitable allometric equations for estimating the biomass of leaf, sub branch, main branch, trunk, and biomass of total Juniperus excelsa tr...
متن کاملEffectiveness of the Self-determination Educational Package on Self-directed Learning and Decision-making Styles among High School Students
Introduction: The purpose of this study was to develop a self-determination educational package and determine its effectiveness on Self-Directed Learning and Decision making Styles of high school students. Methods: The research method was semi-experimental with pre-test, post-test with the control group and follow up. At first, self-determination educational package was compiled using library s...
متن کاملComparison of Ordinal Response Modeling Methods like Decision Trees, Ordinal Forest and L1 Penalized Continuation Ratio Regression in High Dimensional Data
Background: Response variables in most medical and health-related research have an ordinal nature. Conventional modeling methods assume predictor variables to be independent, and consider a large number of samples (n) compared to the number of covariates (p). Therefore, it is not possible to use conventional models for high dimensional genetic data in which p > n. The present study compared th...
متن کاملInstance sampling in credit scoring: An empirical study of sample size and balancing
To date, best practice in sampling credit applicants has been established based largely on expert opinion, which generally recommends that small samples of 1500 instances each of both goods and bads are sufficient, and that the heavily biased datasets observed should be balanced by undersampling themajority class. Consequently, the topics of sample sizes and sample balance have not been subject...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009